Decision trees and Ensemble Methods
DDD: Elements of Statistical Machine Learning & Politics of Data
Ayush Patel
At Azim Premji University, Bhopal
13 Feb, 2026
Did you come prepared?
You have installed R. If not see this link .
You have installed RStudio/Positron/VScode or any other IDE. It is recommended that you work through an IDE
You have the libraries {tree} {gbm} {randomForest} {BART} installed
Learning Goals
What are decision trees?
When to use them?
How do these work?
Application and interpretation.
Improving decision trees using ensemble methods.
Decision Trees
Supervised and non-parametric Can be used for Classification and Regression A predictor space is cut into segments and mean response of training observations is used as the estimate of response of test observations. Simple, easy to interpret but not the best for prediction accuracy on its own Prediction accuracy can be improved by ensemble methods
Intuition - How it works
Can you Identify regions with similar salary range?
Intuition - How it works
Can you Identify regions with similar salary range?
Intuition - How it works
Can you Identify regions with similar salary range?
Intuition - How it works
Can you Identify regions with similar salary range?
Model output - Predictor Space
from ISLR
Model output - Tree
from ISLR
Region representation
\(R1 = \left\{X|Years<4.5\right\}\) \(R2 = \left\{X|Years>=4.5,Hits<117.5\right\}\) \(R2 = \left\{X|Years>=4.5,Hits>=117.5\right\}\)
Tree terminology
from ISLR
How should we carry out segmenting of predictor space?
Segmenting - Theory
Predictor Space of \(p\) variables needs to be segmented into \(J\) different regions. In theory , the regions can be of any shape, however high-dimensional rectangles are chosen in practice for computational ease and interpretability For every observation in region \(R_j\) , we make the same prediction. Mean or mode of the training observations in region \(R_j\) Minimize: \(\sum_{j=1}^{J}{\sum_{i \in R_j}{(y_i - \hat{y}_{R_j})^2}}\)
But
Not easy to consider all possible cutpoints for all possible predictors with all possible sequences We use high-dimensional rectangles instead of any shape for ease of interpretation.
So, use Recursive Binary splitting
top-down greedy
top-down
We brgin at the point where all observations are part of the same region. Hence the name top-down
Greedy
Best split at a particular step. We do not care about the future. A predictor \(p\) and a cutpoint \(s\) is chosen based on which split will lead to the lowest RSS.
This is carried out recursively, over and over again.
Fitting Regresison Trees
tree (body_mass_g ~ ., data = penguins) -> peng_mass_tree
peng_mass_tree
node), split, n, deviance, yval
* denotes terminal node
1) root 333 215300000 4207
2) species: Adelie,Chinstrap 214 40430000 3715
4) sex: female 107 8493000 3419 *
5) sex: male 107 13240000 4010 *
3) species: Gentoo 119 29670000 5092
6) sex: female 58 4519000 4680 *
7) sex: male 61 5884000 5485 *
Fitting Regression Trees
Regression tree:
tree(formula = body_mass_g ~ ., data = penguins)
Variables actually used in tree construction:
[1] "species" "sex"
Number of terminal nodes: 4
Residual mean deviance: 97680 = 32140000 / 329
Distribution of residuals:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-760.30 -219.20 15.16 0.00 220.30 815.20
Fitting Regression Trees
plot (peng_mass_tree)
text (peng_mass_tree, pretty = 0 )
Fitting Regression Trees
na.omit (penguins) |>
mutate (
pred_mass = predict (peng_mass_tree)
) |>
relocate (body_mass_g, pred_mass, everything ())
# A tibble: 333 × 9
body_mass_g pred_mass species island bill_length_mm bill_depth_mm
<int> <dbl> <fct> <fct> <dbl> <dbl>
1 3750 4010. Adelie Torgersen 39.1 18.7
2 3800 3419. Adelie Torgersen 39.5 17.4
3 3250 3419. Adelie Torgersen 40.3 18
4 3450 3419. Adelie Torgersen 36.7 19.3
5 3650 4010. Adelie Torgersen 39.3 20.6
6 3625 3419. Adelie Torgersen 38.9 17.8
7 4675 4010. Adelie Torgersen 39.2 19.6
8 3200 3419. Adelie Torgersen 41.1 17.6
9 3800 4010. Adelie Torgersen 38.6 21.2
10 4400 4010. Adelie Torgersen 34.6 21.1
# ℹ 323 more rows
# ℹ 3 more variables: flipper_length_mm <int>, sex <fct>, year <int>